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We consider the problem of learning causal information between 
random variables in directed acyclic graphs (DAGs) when allowing 
arbitrarily many latent and selection variables. The FCI (Fast Causal 
Inference) algorithm has been explicitly designed to infer conditional 
independence and causal information in such settings. However, FCI 
is computationally infeasible for large graphs. We therefore propose 
the new RFCI algorithm, which is much faster than FCI. In some 
situations the output of RFCI is slightly less informative, in particular 
with respect to conditional independence information. However, we 
prove that any causal information in the output of RFCI is correct in 
the asymptotic limit. We also define a class of graphs on which the 
outputs of FCI and RFCI are identical. We prove consistency of FCI 
and RFCI in sparse high-dimensional settings, and demonstrate in 
simulations that the estimation performances of the algorithms are 
very similar. All software is implemented in the R-package pcalg. 



1. Introduction. We consider the problem of learning the causal struc- 
ture between random variables in acyclic systems with arbitrarily many 
latent and selection variables. As background information, we first discuss 
the situation without latent and selection variables in Section 1.1. Next, in 
Section 1.2 we discuss complications that arise when allowing for arbitrarily 
many latent and selection variables. Our new contributions are outlined in 
Section 1.3. 
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1.1. Systems without latent and selection variables. We first consider sys- 
tems that satisfy the assumption of causal sufficiency, that is, that there are 
no unmeasured common causes and no unmeasured selection variables. We 
assume that causal information between variables can be represented by 
a directed acyclic graph (DAG) in which the vertices represent random vari- 
ables and the edges represent direct causal effects (see, e.g., [13, 14, 20]). In 
particular, X\ is a direct cause of A2 only if X\ — > X2 (i.e., X\ is a parent 
of X2), and X\ is a (possibly indirect) cause of X2 only if there is a directed 
path from X\ to X2 (i.e., X\ is an ancestor of X2). 

Each causal DAG implies a set of conditional independence relationships 
which can be read off from the DAG using a concept called (f-separation [13]. 
Several DAGs can describe exactly the same conditional independence infor- 
mation. Such DAGs are called Markov equivalent and form a Markov equiv- 
alence class. For example, consider DAGs on the variables {Ai, X2, X3}. 
Then X\ — > A 2 — > A3, X\ «— A 2 «— A3 and Ai «— A 2 — > A3 form a Markov 
equivalence class, since they all imply the single conditional independence 
relationship X\ 1L A3jA2, that is, X\ is conditionally independent of A3 
given X2 (using the shorthand notation of Dawid [7]). Another Markov 
equivalence class is given by the single DAG X\ — > X2 A3 , since this is the 
only DAG that implies the conditional independence relationship X\ _IL A3 
alone. Markov equivalence classes of DAGs can be described uniquely by 
a completed partially directed acyclic graph (CPDAG) [3, 4]. 

CPDAGs can be learned from conditional independence information if one 
assumes faithfulness, that is, if the conditional independence relationships 
among the variables are exactly equal to those that are implied by the DAG 
via d-separation. For example, suppose that the distribution of {Ai, A2, A3} 
is faithful to an unknown underlying causal DAG, and that the only con- 
ditional independence relationship is Ai il_ A3IA2. Then the corresponding 
Markov equivalence class consists of Ai — >■ A2 — > A3 , X\ <— A2 4— A3 and 
X\ <— X2 —> A3 , and we know that one of these three DAGs must be the true 
causal DAG. Algorithms that are based on this idea are called constraint- 
based algorithms, and a prominent example is the PC algorithm [20]. The 
PC algorithm is sound (i.e., correct) and complete (i.e., maximally informa- 
tive) under the assumptions of causal sufficiency and faithfulness [20]. It is 
efficiently implemented in the R-package pcalg [9], and was shown to be 
asymptotically consistent in sparse high-dimensional settings [8]. 

In practice, one often wants to estimate not only the Markov equivalence 
class of DAGs, but also the size of causal effects between pairs of variables. 
In the special case that the estimated CPDAG represents a single DAG, one 
can do this via, for example, Pearl's do-calculus (also called intervention cal- 
culus; see [13]) or marginal structural models [18]. If the estimated CPDAG 
represents several DAGs, one can conceptually estimate causal effects for 
each DAG in the Markov equivalence class, and use these values to infer 
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bounds on causal effects. This idea, together with a fast local implementa- 
tion, forms the basis of the IDA algorithm [10, 11] which estimates bounds on 
causal effects from observational data that are generated from an unknown 
causal DAG (IDA stands for Intervention calculus when the DAG is Absent). 
The IDA algorithm was shown to be consistent in sparse high-dimensional 
settings [11], and was validated on a challenging high-dimensional yeast gene 
expression data set [10]. 

1.2. Complications arising from latent and selection variables. In prac- 
tice there are often latent variables, that is, variables that are not mea- 
sured or recorded. Statistically speaking, these variables are marginalized 
out. Moreover, there can be selection variables, that is, unmeasured vari- 
ables that determine whether or not a measured unit is included in the data 
sample. Statistically speaking, these variables are conditioned on (see [6, 21] 
for a more detailed discussion). Latent and selection variables cause several 
complications. 

The first problem is that causal inference based on the PC algorithm 
may be incorrect. For example, consider the DAG in Figure 1(a) with ob- 
served variables X = {Ai, X2, A3} and latent variables L = {Li,^}- The 
only conditional independence relationship among the observed variables is 
X\ _1L A3 . There is only one DAG on X that implies this single conditional 
independence relationship, namely X\ — > X2 <— A3, and this will therefore 
be the output of the PC algorithm; see Figure 1(b). This output would lead 
us to believe that both X\ and A3 are causes of X<i- But this is clearly in- 
correct, since in the underlying DAG with latent variables, there is neither 
a directed path from X\ to X2 nor one from A3 to A2. 

A second problem is that the space of DAGs is not closed under marginal- 
ization and conditioning [16] in the following sense. If a distribution is faith- 
ful to a DAG, then the distribution obtained by marginalizing out and con- 
ditioning on some of the variables may not be faithful to any DAG on the 
observed variables. For example, consider the DAG X\ — > X2 ^— L\ — > A3 «— 
A4 . This DAG implies the following set of conditional independence relation- 
ships among the observed variables X = {Ai, . . . , A4}: X\ _U_ A3, X\ _LL A4, 



CE) (5) CE) (^)~ *(**)*~ (2) 

(a) (b) 



x 2 
(c) 



Fig. 1. Graphs corresponding to the examples in Section 1.2. Throughout we use square 
boxes to represent latent variables and circles to represent observed variables, (a) DAG 
with latent variables; (b) CPDAG; (c) PAG. 
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X 2 JL X 4 , Xi JL A 3 |A 4 , Xi JL A 4 |A 2 , X 1 JL A 4 |A 3 and A 2 _U_ X 4 \X X , and 
others implied by these. There is no DAG on X that entails exactly this set 
of conditional independencies via d-separation. 

These problems can be solved by introducing a new class of graphs on 
the observed variables, called maximal ancestral graphs (MAGs) [16]. Every 
DAG with latent and selection variables can be transformed into a unique 
MAG over the observed variables ([16], page 981). Several DAGs can lead 
to the same MAG. In fact, a MAG describes infinitely many DAGs since no 
restrictions are made on the number of latent and selection variables. 

MAGs encode causal relationships between the observed variables via the 
edge marks. For example, consider the edge X\ — > X 2 in a MAG. The tail 
at X\ implies that X± is a cause (ancestor) of X 2 or of a selection variable, 
and the arrowhead at X 2 implies that X 2 is not a cause (not an ancestor) 
of X\ nor of any selection variable, in all possible underlying DAGs with 
latent and selection variables. Moreover, MAGs encode conditional inde- 
pendence relationships among the observed variables via m-separation [16], 
a generalization of d-separation (see Definition 2.1 in Section 2.2). Several 
MAGs can describe exactly the same conditional independence relation- 
ships; see [2]. Such MAGs form a Markov equivalence class which can be 
represented by a partial ancestral graph (PAG); see Definition 3.1. PAGs 
describe causal features common to every MAG in the Markov equivalence 
class, and hence to every DAG (possibly with latent and selection variables) 
compatible with the observable independence structure under the assump- 
tion of faithfulness. For example, consider again the DAG in Figure 1(a). The 
only conditional independence relationship among the observed variables is 
X\ JL A3, and this is represented by the PAG in Figure 1(c). This PAG 
implies that X 2 is not a cause (ancestor) of X%, A3 or a selection variable, 
and this is indeed the case in the underlying DAG in Figure 1(a) and is true 
of any DAG that, assuming faithfulness, could have implied X\ JL A3. The 
two circle marks at X\ and A3 in Figure 1(c) represent uncertainty about 
whether or not Ai and A3 are causes of X 2 . This reflects the fact that the 
single conditional independence relationship X\ JL A3 among the observed 
variables can arise from the DAG Ai — > X 2 «— A3 in which Ai and A3 are 
causes of X 2 , but it can also arise from the DAG in Figure 1 (a) in which Ai 
and A3 are not causes of X 2 . 

Under the faithfulness assumption, a Markov equivalence class of DAGs 
with latent and selection variables can be learned from conditional indepen- 
dence information among the observed variables alone using the Fast Causal 
Inference (FCI) algorithm [20], which is a modification of the PC algorithm. 
Originally, the output of FCI was defined as a partially oriented inducing 
path graph (POIPG), but its output can also be interpreted as a PAG [23]. 
Spirtes et al. [20] proved that the FCI algorithm is sound in the presence of 
arbitrarily many latent variables. Spirtes et al. [21] extended the soundness 
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proof to allow for selection variables as well. Zhang [24] recently introduced 
extra orientation rules that make FCI complete when its output is inter- 
preted as a PAG. Despite its name, FCI is computationally very intensive 
for large graphs. 

Spirtes [19] introduced a modified version of FCI, called Anytime FCI, 
that only considers conditional independence tests with conditioning sets of 
size less than some prespecified cut-off K. Anytime FCI is typically faster 
but less informative than FCI, but the causal interpretation of tails and 
arrowheads in its output is still sound. 

Some work on the estimation of the size of causal effects in situations 
with latent and selection variables can be found in [17, 23] and in Chapter 7 
of [20]. 

1.3. New contributions. We introduce a new algorithm for learning PAGs, 
called the Really Fast Causal Inference (RFCI) algorithm (see Section 3.2). 
RFCI uses fewer conditional independence tests than FCI, and its tests con- 
dition on a smaller number of variables. As a result, RFCI is much faster 
than FCI and its output tends to be more reliable for small samples, since 
conditional independence tests of high order have low power. On the other 
hand, the output of RFCI may be less informative. In this sense, the algo- 
rithm is related to the Anytime FCI algorithm [19]. 

In Section 3.4 we compare the outputs of FCI and RFCI, and define a class 
of graphs for which the outputs of FCI and RFCI are identical. 

In Section 4 we prove consistency of FCI and RFCI in sparse high- 
dimensional settings. The sparsity conditions needed for consistency of FCI 
are stronger than those for RFCI, due to the higher complexity of the FCI 
algorithm. 

In order to compare RFCI to existing algorithms, we propose several 
small modifications of FCI and Anytime FCI. In particular, we introduce 
the Adaptive Anytime FCI (AAFCI) algorithm (see Section 3 of the sup- 
plementary document [5]) and we propose several ways to speed up the FCI 
and AAFCI algorithms (see Section 3.1). 

We show in simulations (see Section 5) that the numbers of errors made 
by all algorithms are very similar. Moreover, we show that our modifications 
of FCI and AAFCI shorten the computation time considerably, but that for 
large graphs, RFCI is the only feasible algorithm. 

All proofs, a description of AAFCI, pseudocodes and two additional exam- 
ples are given in the supplementary document [5] . The R-package pcalg [9] 
contains implementations of all algorithms. 

2. Preliminaries. This section introduces terminology that is used through- 
out the paper. Section 2.1 defines various graphical concepts, and Section 2.2 
describes how graphs can be interpreted probabilistically and causally. 
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2.1. Graphical definitions. A graph Q = (V,E) is composed of a set of 
vertices V = {X±, . . . , X p } and a set of edges E. In our framework, the ver- 
tices represent random variables and the edges describe conditional inde- 
pendence and ancestral relationships. The edge set E can contain (a sub- 
set of) the following six types of edges: — > (directed), -f-> (bi- directed), — 
(undirected), m (nondirected) , <>— (partially undirected) and o— >■ (partially 
directed). The endpoints of an edge are called marks and they can be tails, 
arrowheads or circles. We use the symbol "*" to denote an arbitrary edge 
mark. A graph containing only directed edges is called directed, and one 
containing only undirected edges is called undirected. A mixed graph can 
contain directed, bi-directed and undirected edges. If we are only interested 
in the presence and absence of edges in a graph and not in the edge marks, 
we refer to the skeleton of the graph. 

All the graphs we consider are simple in that there is at most one edge 
between any two vertices. If an edge is present, the vertices are said to be 
adjacent. If all pairs of vertices in a graph are adjacent, the graph is called 
complete. The adjacency set of a vertex Aj in a graph Q is the set of all 
vertices in V \ {Aj} that are adjacent to Aj in Q, denoted by adj(£, Aj). 
A vertex Xj in adj(<7, Aj) is called a parent of Aj if Xj — > Aj, a child of Aj 
if Aj — > Xj , a spouse of Aj if Aj o Xj , and a neighbor of Aj if Aj — Xj . The 
corresponding sets of parents, children, spouses and neighbors are denoted 
by pa(<7, Aj), ch(Q, Aj), sp(Q, Aj) and ne(Q, Aj), respectively. 

A path is a sequence of distinct adjacent vertices. A path (Aj, Aj, . . . , A&) 
is said to be out of (into) Aj if the edge between Aj and Xj has a tail 
(arrowhead) at Aj . A directed path is a path along directed edges that follows 
the direction of the arrowheads. A cycle occurs when there is a path from Aj 
to Xj and Aj and Xj are adjacent. A directed path from Aj to Xj forms 
a directed cycle together with the edge Xj — > Aj, and it forms an almost 
directed cycle together with the edge Xj <-> Aj . If there is a directed path tt 
from Aj to Xj or if Aj = Xj , the vertex Aj is called an ancestor of Xj and Xj 
a descendant of Aj. The sets of ancestors and descendants of a vertex Aj 
in Q are denoted by an(C?, Aj) and de(Q, Aj), respectively. These definitions 
are applied to a set Y C V of distinct vertices as follows: 

an(<7, Y) = {AjjAj € &n(Q,Xj) for some Xj € Y}; 
de(G, Y) = {AjjAj € de(Q,Xj) for some Xj € Y}. 

Three vertices that form a cycle are called a triangle. Three vertices 
(Aj,Aj,Afc) are called an unshielded triple if Aj and Xj are adjacent, Xj 
and Afc are adjacent, but Aj and X^ are not adjacent. A nonendpoint ver- 
tex Xj on a path tt is a collider on the path if both the edges preceding and 
succeeding it have an arrowhead at Xj, that is, if the path contains =fH>Aj-f-*. 
A nonendpoint vertex Xj on a path ir which is not a collider is a noncollider 
on the path. An unshielded triple (Aj, Aj, A^.) is called a v-structure if Xj 
is a collider on the path (Aj, Xj, X^). 
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A path 7r = (Xi, . . . , Xj,Xb, X p ) in a mixed graph is called a discriminating 
path for Xi, if the following three conditions hold: (i) tt includes at least three 
edges; (ii) X\, is a nonendpoint vertex on tt and is adjacent to X p on tt; and 
(iii) Xi is not adjacent to X p in the graph and every vertex between X\ 
and Xb is a collider on tt and a parent of X p . An example of a discriminating 
path is given in Figure 4 of [5], where the circle marks are replaced by stars. 

A graph Q = (V, E) is called connected if there exists a path between any 
pair of vertices in V. A graph is called biconnected if it is connected and 
remains so if any vertex and its incident edges were to be removed. A bicon- 
nected component of a graph is a maximally biconnected subgraph [1]. 

A directed graph Q = (V, E) is called a directed acyclic graph (DAG) if 
it does not contain directed cycles. A mixed graph Q = (V, E) is called an 
ancestral graph if (i) it does not contain directed cycles, (ii) it does not 
contain almost directed cycles, and (iii) for any undirected edge Xi — Xj 
in E, Xi and Xj have no parents or spouses. DAGs form a subset of ancestral 
graphs. 

2.2. Probabilistic and causal interpretation of graphs. A DAG entails 
conditional independence relationships via a graphical criterion called d- 
separation, which is a special case of m-separation: 

Definition 2.1 (Richardson and Spirtes [16]). A path tt in an ancestral 
graph is said to be blocked by a set of vertices Y if and only if: 

(i) tt contains a subpath (Xi,Xj,Xk) such that the middle vertex Xj is 
a noncollider on this path and Xj £ Y, or 

(ii) tt contains a v-structure Xi*-tXj<-*Xk such that Xj £ Y and no 
descendant of Xj is in Y. 

Vertices Z and W are m-separated by Y if every path tt between Z and W 
is blocked by Y. Sets of vertices Z and W are m-separated by Y if all pairs 
of vertices 2sZ, W G W are m-separated by Y. 

If two vertices Xi and Xj in a DAG Q are ci-separated by a subset Y of the 
remaining vertices, then Xi ALXj\Y in any distribution Q that factorizes 
according to Q (i.e., the joint density can be written as the product of the 
conditional densities of each variable given its parents in Q: q(X\, . . . , X p ) = 
Y\? =1 q(Xi\p&(G, Xi)). A distribution Q is said to be faithful to a DAG Q if 
the reverse implication also holds, that is, if the conditional independence 
relationships in Q are exactly the same as those that can be inferred from Q 
using d-separation. A set Y that d-separates Xi and Xj in a DAG is called 
a minimal separating set if no subset of Y d-separates Xi and Xj. A set Y is 
a minimal separating set for Xi and Xj given S if Xi and Xj are (i-separated 
by Y U S and there is no subset Y' of Y such that Xi and Xj are d-separated 
by Y'US. 
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When a DAG Q = (V,E) contains latent and selection variables, we write 
V = XULUS, where X represents the observed variables, L represents the 
latent variables and S represents the selection variables, and these sets are 
disjoint (i.e., U denotes the union of disjoint sets). 

A maximal ancestral graph (MAG) is an ancestral graph in which ev- 
ery missing edge corresponds to a conditional independence relationship. 
Richardson and Spirtes ([16], page 981) give an algorithm to transform 
a DAG G = (XULUS, E) into a unique MAG G* as follows. Let Q* have 
vertex set X. For any pair of vertices Xj, Xj € X make them adjacent in Q* 
if and only if there is an inducing path (see Definition 3.5) between Xj 
and Xj in Q relative to X given S. Moreover, for each edge Xj*-*Xj in Q* 
put an arrowhead at Xi if Xj ^ an(t/,{Xj} U S) and put a tail otherwise. 
The resulting MAG Q* = (X, E*) encodes the conditional independence rela- 
tionships holding in Q among the observed variables X conditional on some 
value for the selection variables S = s; thus if Xi and Xj are m-separated 
by Y in Q* , then Xi and Xj are (i-separated by Y U S in Q and hence 
Xi _LL Ajj(Y U {S = s}) in any distribution Q factorizing according to Q. 
Perhaps more importantly, an ancestral graph preserves the ancestral rela- 
tionships encoded in the DAG. 

Throughout the remainder of this paper, S refers to either the set of 
variables S or the event S = s, depending on the context. 

3. Oracle versions of the algorithms. We consider the following problem: 
assuming that the distribution ofV = XULUS is faithful to an unknown 
underlying causal DAG Q = (V,E), and given oracle information about all 
conditional independence relationships between pairs of variables Xi and Xj 
in X given sets Y U S where Y C X \ { Xi , Xj } , we want to infer information 
about the ancestral (causal) relationships of the variables in the underlying 
DAG, which we represent via a PAG. 

We discuss and compare two algorithms for this purpose, the FCI algo- 
rithm and our new RFCI algorithm. We first define the outputs of both 
algorithms: an FCI-PAG and an RFCI-PAG. (An FCI-PAG is usually re- 
ferred to simply as a "PAG," but in the remainder of this paper we use the 
name FCI-PAG to make a clear distinction between the output of the two 
algorithms.) 

Definition 3.1. Let Q be a DAG with partitioned vertex set X U L U S. 
Let C be a simple graph with vertex set X and edges of the type — >, 
o-o, — or o — . Then C is said to be an FCI-PAG that represents Q if and 
only if, for any distribution P of X U L U S that is faithful to Q, the following 
four conditions hold: 

(i) the absence of an edge between two vertices Xi and Xj in C implies 
that there exists a subset Y C X \ {Xi , Xj } such that Xi _LL Xj \ (Y US) in P; 
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(ii) the presence of an edge between two vertices Xi and Xj in C implies 
that Xi JlXj\(Y US) in P for all subsets YCX\{X i ,I j }; 

(iii) if an edge between X{ and Xj in C has an arrowhead at Xj, then 
Xj <£an(g,XiUS); 

(iv) if an edge between Xi and Xj in C has a tail at Xj, then Xj € 
an(£,A;US). 

Definition 3.2. Let ^ be a DAG with partitioned vertex set XULUS. 
Let C be a simple graph with vertex set X and edges of the type — >, o— 
f->-, — , or o — . Then C is said to be an RFCI-PAG that represents Q if and 
only if, for any distribution P of X U L U S that is faithful to Q, conditions (i), 
(iii) and (iv) of Definition 3.1 and the following condition hold: 

(ii') the presence of an edge between two vertices Xi and Xj in C implies 
that Xi JlXj | (Y U S) for all subsets Y C adj(C, Xi) \ {Xj } and for all subsets 
YCadj(C,Xj)\{Xi}. 

Condition (ii) in Definition 3.1 is stronger than condition (ii') in Defi- 
nition 3.2. Hence, the presence of an edge in an RFCI-PAG has a weaker 
interpretation than in an FCI-PAG. This has several consequences. First, ev- 
ery FCI-PAG is an RFCI-PAG. Second, different RFCI-PAGs for the same 
underlying DAG may have different skeletons, while the FCI-PAG skeleton is 
unique. In general, the RFCI-PAG skeleton is a supergraph of the FCI-PAG 
skeleton. Finally, an RFCI-PAG can correspond to more than one Markov 
equivalence class of DAGs (see Example 2 in Section 3.3). 

It is worth noting that every FCI-PAG is an RFCI-PAG. Moreover, for 
a given pair of a graph Q and a distribution P faithful to it, there may be two 
different FCI-PAGs that represent Q but they will have the same skeleton. 
On the other hand, for a given pair of a graph Q and a distribution P faithful 
to it, there may also be more than one RFCI-PAG that represents Q and 
these different RFCI-PAGs can also have different skeletons. 

The remainder of this section is organized as follows. Section 3.1 briefly 
discusses the FCI algorithm and proposes modifications that can speed up 
the algorithm while remaining sound and complete. Section 3.2 introduces 
our new RFCI algorithm. Section 3.3 discusses several examples that illus- 
trate the commonalities and differences between the two algorithms, and 
Section 3.4 defines a class of graphs for which the outputs of FCI and RFCI 
are identical. 

3.1. The FCI algorithm. A high-level sketch of FCI ([20], pages 144 
and 145) is given in Algorithm 3.1. The sub-algorithms 4.1-4.3 are given 
in [5]. 

The determination of adjacencies in the PAG within the FCI algorithm is 
based on the following fact: if Xi is not an ancestor of Xj, and Xi and Xj 
are conditionally independent given some set Y U S where Y C X\ {Xi, Xj}, 
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Algorithm 3.1 The FCI algorithm 

Require: Conditional independence information among all variables in X given S 
1: Use Algorithm 4.1 of [5] to find an initial skeleton (C), separation sets (sepset) and 

unshielded triple list (2JI); 
2: Use Algorithm 4.2 of [5] to orient v-structures (update C); 
3: Use Algorithm 4.3 of [5] to find the final skeleton (update C and sepset); 
4: Use Algorithm 4.2 of [5] to orient v-structures (update C); 

5: Use rules (R1)-(R10) of [24] to orient as many edge marks as possible (update C); 
6: return C, sepset. 



then Xi and Xj are conditionally independent given Y'uS for some sub- 
set Y' of acertain set D-SEP(X i ,X i ) or of D-SEP(X,-,Xi) (see [20], page 134 
for a definition). This means that, in order to determine whether there is an 
edge between Xi and Xj in an FCI-PAG, one does not need to test whether 
Xi _lLXj|(YUS) for all possible subsets YCX\{J i) J j }, but only for all 
possible subsets Y C D-SEP(Xj, Xj) and Y C D-SEP(X j , Xi). Since the sets 
D-SEP(Xj, Xj) cannot be inferred from the observed conditional indepen- 
dencies, Spirtes et al. [20] defined a superset, called Possible-D-SEP, that 
can be computed: 

Definition 3.3. Let C be a graph with any of the following edge types: 
o-o, o— >■, f-K Possible-D-SEP [Xi^Xj) in C, denoted in shorthand by pds(C, 
Xi,Xj), is defined as follows: X^ G pds(C, Xj, Xj) if and only if there is 
a path 7r between Xi and X& in C such that for every subpath {X m ,Xi,Xh) 
of 7r, Xi is a collider on the subpath in C or (X m ,X/,X/j) is a triangle in C. 

Remark 3.1. Note that Xj does not play a role in the definition of 
pds(C, Xi,Xj), but we keep it as an argument because we will later con- 
sider alternative definitions of Possible-D-SEP (see Definition 3.4) where the 
second vertex Xj does play a role. 

Since the definition of Possible-D-SEP requires some knowledge about 
the skeleton and orientation of edges, the FCI algorithm first finds an initial 
skeleton denoted by C\ in Step 1. This is done as in the PC-algorithm, by 
starting with a complete graph with edges and performing conditional 
independence tests given subsets of increasing size of the adjacency sets of 
the vertices. An edge between Xi and Xj is deleted if a conditional indepen- 
dence is found, and the set responsible for this conditional independence is 
saved in sepset (Xj, Xj) and sepset (Xj,Xj) (see Algorithm 4.1 of [5]). The 
skeleton after completion of Step 1 is a superset of the final skeleton. 

In Step 2, the algorithm orients unshielded triples Xi *-o Xj °-* Xk as v- 
structures Xj^Xj^X^ if and only if Xj is not in sepset (Xj, Xk) and 
sepset (Xfc,Xj) (see Algorithm 4.2 of [5]). 

The graph resulting after Step 2, denoted by C2, contains sufficient in- 
formation to compute the Possible-D-SEP sets. Thus, in Step 3, the algo- 
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rithm computes pds(C2, Xi, •) for every Xi € X. Then for every element Xj in 
adj (C2 ,Xi), the algorithm tests whether Xi JL Xj | ( Y U S) for every subset Y 
of pds(C 2 , X u •) \ {Xi, Xj} and of pds(C 2 , Xj, ■) \ {Xj,Xij (see Algorithm 4.3 
of [5]). As in Step 1, the tests are arranged in a hierarchical way starting 
with conditioning sets of small size. If there exists a set Y that makes Xi 
and Xj conditionally independent given Y U S, the edge between Xi and Xj 
is removed and the set Y is saved as the separation set in sepset(Aj, Xj) 
and sepset(X j,Xi). After all conditional independence tests are completed, 
every edge in C is reoriented as 0-0 , since the orientation of v-structures 
in Step 2 of the algorithm cannot necessarily be interpreted as specified in 
conditions (iii) and (iv) of Definition 3.1. 

In Step 4, the v-structures are therefore oriented again based on the up- 
dated skeleton and the updated information in sepset (see Algorithm 4.2 
of [5]). Finally, in Step 5 the algorithm replaces as many circles as possible 
by arrowheads and tails using the orientation rules described by [24]. 

First proposed modification: FCI p3b th- F° r sparse graphs, Step 3 of the 
FCI algorithm dramatically increases the computational complexity of the 
algorithm when compared to the PC algorithm. The additional computa- 
tional effort can be divided in two parts: computing the Possible-D-SEP sets, 
and testing conditional independence given all subsets of these sets. The lat- 
ter part is computationally infeasible when the sets pds(C2, Xi, •) are large, 
containing, say, more than 30 vertices. Since the size of the Possible-D-SEP 
sets plays such an important role in the complexity of the FCI algorithm, 
and since one has some freedom in defining these sets (they simply must be 
supersets of the D-SEP sets), we first propose a modification of the definition 
of Possible-D-SEP that can decrease its size. 

Definition 3.4. Let C be a graph with any of the following edge types: 
0-0, o-^ o-. Then, for two vertices Xi and Xj adjacent in C, pds path (C, X^Xj) 
is defined as follows: X^ £ pds path (C, Xi, Xj) if and only if (i) there is a path ir 
between Xi and X^ in C such that for every subpath {X m , X[,Xh) of tt, Xi is 
a collider on the subpath in C or (X m ,Xi,Xf l ) is a triangle in C, and (ii) X^ 
lies on a path between Xi and Xj. 

For any pair of adjacent vertices Xi and Xj in a graph C, the set pds path (C, 
Xi, Xj) can be computed easily by intersecting pds(C, Xi, ■) with the unique 
biconnected component in C that contains the edge between Xi and Xj. 
Algorithm 4.3 of [5] can now be modified as follows. Before line 1, we compute 
all biconnected components of the graph C 2 , where C2 is the graph resulting 
from Step 2 of the FCI algorithm. Then between lines 3 and 4, we compute 
pd 

s path (^2 j Xi , Xj ) as described above. Finally, on lines 8, 13 and 14, we 
replace pds(C2,Aj,-) by pds path (C2,Xi,Xj). We refer to the FCI algorithm 
with this modified version of Algorithm 4.3 of [5] as FCI pat h- 
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Second class of modifications: CFCI, CFCI pa- th, SCFCI and SCFCI pa ,th- 
Another possibility to decrease the size of Possible- D-SEP is to use conserva- 
tive rules to orient v-structures in Step 2 of the FCI algorithm, so that fewer 
arrowheads are introduced, similarly to the Conservative PC algorithm [15]. 
This is especially helpful in the sample version of the algorithm (see Sec- 
tion 4.1), as the sample version tends to orient too many v-structures, which 
can lead to long chains of bi-directed edges and hence large Possible-D-SEP 
sets (see Figure 6 in Section 5.3). 

The conservative orientation works as follows. For all unshielded triples 
(Xi,Xj,Xk) in C\, where C\ is the graph resulting from Step 1 of the FCI 
algorithm, we determine all subsets Y of adj(Ci,Xj) and of adj(Ci,Xfc) sat- 
isfying X{ _LL A&|(Y U S). We refer to these sets as separating sets, and we 
label the triple (Xi,Xj,Xk) as unambiguous if and only if (i) at least one 
separating set Y is found and either Xj is in all separating sets and in 
sepset(Xj, Xk) or Xj is in none of the separating sets nor in sepset(Xj, X&), 
or (ii) no such separating set Y is found. [Condition (ii) can occur, since sep- 
arating sets found in Step 1 of the FCI algorithm do not need to be a subset 
of adj(Ci,-Xj) or of adj(Ci, X^).] At the end of Step 2, we only orient un- 
ambiguous triples satisfying Xj ^ sepset(Xj, X^j as v-structures. This may 
lead to different Possible-D-SEP sets in Step 3 (even in the oracle version 
of the algorithm), but other than that, Steps 3-5 of the algorithm remain 
unchanged. We refer to this version of the FCI algorithm as Conservative 
FCI (CFCI). If CFCI is used in combination with pds path , we use the name 

CFCIpath- 

Finally, the idea of conservative v-structures can also be applied in Step 4 
of the FCI algorithm. For each unshielded triple (Xj, Xj, X^) in C3, where C3 
is the graph resulting from Step 3, we determine all subsets Y of adj(C3, Xj) 
and of adj(C3, X&) satisfying Xj _LL A^|(Y US). We then determine if a triple 
is unambiguous, and only if this is the case we orient it as v-structure or 
non-v-structure. Moreover, the orientation rules in Step 5 of the algorithm 
are adapted so that they only rely on unambiguous triples. We use the name 
Superconservative FCI (SCFCI) to refer to the version of FCI that uses con- 
servative v-structures in both Steps 2 and 4. If SCFCI is used in combination 
with pds path , we use the name SCFCI pat h. The proof of Theorem 3.1 shows 
that the output of the oracle version of SCFCI is identical to that of CFCI. 
We still consider this version, however, in the hope to obtain better edge 
orientations in the sample versions of the algorithms, where the outputs are 
typically not identical. 

Soundness of FCI follows from Theorem 5 of [21]. Soundness results for 
the modifications FCI path , CFCI, CFCI path , SCFCI and SCFCI path are given 
in the following theorem: 

Theorem 3.1. Consider one of the oracle versions of FC7 pa th, CFCI, 
CFCI path , SCFCI or SCFCI path . Let the distribution o/V = XULOS be 
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faithful to a DAG Q and let conditional independence information among 
all variables in X given S be the input to the algorithm. Then the output of 
the algorithm is an FCI-PAG of Q. 

Completeness of FCI was proved by [24] . This means that the output of 
FCI is maximally informative, in the sense that for every circle mark there 
exists at least one MAG in the Markov equivalence class represented by the 
PAG where the mark is oriented as a tail, and at least one where it is oriented 
as an arrowhead. Completeness results of FCI pat h, CFCI, CFCI pat h, SCFCI 
and SCFCIp a th follow directly from the fact that, in the oracle versions, the 
orientation rules of these modifications boil down to the orientation rules of 
FCI. 

3.2. The RFCI algorithm. The Really Fast Causal Inference (RFCI) al- 
gorithm is a modification of FCI. The main difference is that RFCI avoids 
the conditional independence tests given subsets of Possible-D-SEP sets, 
which can become very large even for sparse graphs. Instead, RFCI per- 
forms some additional tests before orienting v-structures and discriminating 
paths in order to ensure soundness, based on Lemmas 3.1 and 3.2 below. 
The number of these additional tests and the size of their conditioning sets 
is small for sparse graphs, since RFCI only conditions on subsets of the adja- 
cency sets. As a result, RFCI is much faster than FCI for sparse graphs (see 
Section 5.3). Moreover, the lower computational complexity of RFCI leads 
to high-dimensional consistency results under weaker conditions than FCI 
[compare conditions (A3) and (A3') in Sections 4.2 and 4.3]. A high-level 
sketch of RFCI is given in Algorithm 3.2. 

Step 1 of the algorithm is identical to Step 1 of Algorithm 3.1, and is 
used to find an initial skeleton C\ that satisfies conditions (i) and (ii') of 
Definition 3.2. 

In Step 2 of the algorithm, unshielded triples are oriented based on Lem- 
ma 3.1 and some further edges may be removed. 

Lemma 3 . 1 (Unshielded triple rule) . Let the distribution o/V = XULLJS 
be faithful to a DAG Q. Assume that (al) is a minimal separating set 



Algorithm 3.2 The RFCI algorithm 

Require: Conditional independence information among all variables in X given S 
1: Use Algorithm 4.1 of [5] to find an initial skeleton (C), separation sets (sepset) and 

unshielded triple list (SOI); 
2: Use Algorithm 4.4 of [5] to orient v-structures (update C and sepset); 
3: Use Algorithm 4.5 of [5] to orient as many edge marks as possible (update C and 

sepset); 
4: return C, sepset. 
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for Xi and Xk given S, and (a2) Xj and Xj as well as Xj and Xk are con- 
ditionally dependent given (Sik \ {Xj}) U S. Then Xj 6 an(Q, {Xj,Xfc} U S) 
if and only if Xj £ Sjfe . 

The details of Step 2 are given in Algorithm 4.4 of [5]. We start with 
a list 97t of all unshielded triples in C±, where C\ is the graph resulting from 
Step 1 of the RFCI algorithm, and an empty list £ that is used to store 
triples that were found to satisfy the conditions of Lemma 3.1. For each 
triple {Xi,Xj,Xk) in 971, we check if both Xi and Xj and Xj and Xk are 
conditionally dependent given (sepset(Xj, Xk) \ {Xj}) U S. These conditional 
dependencies may not have been checked in Step 1 of the algorithm, since 
sepset(Xj, Xk) \{Xj} does not need to be a subset of adj(Ci, Xj). If both con- 
ditional dependencies hold, the triple satisfies the conditions of Lemma 3.1 
and is added to £. On the other hand, an additional conditional independ- 
ence relationship may be detected, say Xi JL Xj|((sepset(Xj, Xk) \ {Xj}) U S) 
This may arise in a situation where X, and Xj are not m-separated given 
a subset of vertices adjacent to Xi, and are not m-separated given a subset 
of vertices adjacent to Xj, but they do happen to be m-separated given 
the set (sepset(Xj, X&) \ {Xj}) U S. In this situation, we remove the edge 
Xj*-*Xj from the graph, in agreement with condition (i) of Definition 3.2. 
The removal of this edge can create new unshielded triples, which are added 
to 971. Moreover, it can destroy unshielded triples in £ and 971, which are 
therefore removed. Finally, by testing subsets of the conditioning set which 
led to removal of the edge, we find a minimal separating set for Xi and Xj 
and store it in sepset(Xj, Xj) and sepset(Xj, Xi). Example 1 of [5] shows 
that it is not sufficient to simply store sepset(Xj, X^) \ {Xj} since it may 
not be minimal for Xi and Xj. We work with the lists 97T and £ to en- 
sure that the result of Step 2 does not depend on the order in which the 
unshielded triples are considered. 

After Step 2, all unshielded triples still present in the graph are correctly 
oriented as a v-structure or non-v-structure. In Step 3, the algorithm orients 
as many further edges as possible, as described in Algorithm 4.5 of [5]. This 
procedure consists of repeated applications of the orientation rules (Rl)- 
(R10) of [24], with the difference that rule (R4) about the discriminating 
path has been modified according to Lemma 3.2. 

Lemma 3.2 (Discriminating path rule). Let the distribution of V = 
XUL U S be faithful to a DAG Q . Let iiik = (Xi, . . . , Xi, Xj, Xk) be a sequence 
of at least four vertices that satisfy: (al) Xi and Xk are conditionally inde- 
pendent given Sik U S, (a2) any two successive vertices Xh and X^+i on TXik 
are conditionally dependent given (Y' \ {X^, X^+i}) U S for all Y' C Sik, 
(a3) all vertices Xh between Xi and Xj (not including Xi and Xj) satisfy 
X h G an(g,X k ) and X h £ an(£, {X h ^,X h+1 } U S), where X h _i and X h+1 
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denote the vertices adjacent to Xh on mk- Then the following hold: (bl) if 
Xj € Sik, then Xj € an((/, {Xk} U S) and Xk ^ an(t/, {Xj} U S) ; and (b2) if 
Xj <£ S ik , then Xj £ an(£, {X h X k } U S) and X k £ an(£, {Xj} U S). 

Lemma 3.2 is applied as follows. For each triangle (X/,Xj,Xfc) of the form 
Xj^fXk, Xj^Xi and Xi — > X^, the algorithm searches for a discriminating 
path 7r = (Xi, . . . , X, Xj, Xk) for of minimal length, and checks that the 
vertices in every consecutive pair (X r ,X q ) on ir are conditionally dependent 
given Y U S for all subsets Y of sepset(Xj, X&) \ {X r ,X q }. (Example 2 of [5] 
shows why it is not sufficient to only check conditional dependence given 
(sepset(Xj, Xk) \ {X r ,X q }) U S, as we did for the v-structures.) If we do 
not find any conditional independence relationship, the path satisfies the 
conditions of Lemma 3.2 and is oriented as in rule (R4) of [24]. If one or 
more conditional independence relationships are found, the corresponding 
edges are removed, their minimal separating sets are stored, and any new 
unshielded triples that are created by removing the edges are oriented using 
Algorithm 4.4 of [5]. We note that the output of Step 3 may depend on the 
order in which the discriminating paths are considered. 

Soundness of RFCI is stated in the following theorem. 

Theorem 3.2. Let the distribution o/ V = XULUS be faithful to 
a DAG Q and let conditional independence information among all variables 
in X given S be the input to the RFCI algorithm. Then the output of RFCI 
is an RFCI-PAG of Q . 

Remark 3.2. The new orientation rules based on Lemmas 3.1 and 3.2 
open possibilities for different modifications of the FCI algorithm. For exam- 
ple, one could replace pds(C,Xj,Xj) by pds fc (C,Xj,Xj), where a vertex Xi 
is in pds fc (C, Xj, Xj) if it is in pds(C,Xj,Xj) and there is a path between Xj 
and X\ containing no more than k + 1 vertices. This modification yields 
a skeleton that is typically a superset of the skeleton of the true FCI-PAG. 
In order to infer correct causal orientations based on this skeleton, one needs 
to use Algorithms 4.4 and 4.5 of [5] to determine the final orientations of 
the edges. The parameter k represents a trade-off between computing time 
and informativeness of the output, where k = 1 corresponds to the RFCI 
algorithm and k = |Xj — 2 corresponds to the FCI algorithm. 

Another way to obtain a more informative but slower version of RFCI 
can be obtained by modifying Step 1 of the RFCI algorithm: instead of 
considering all subsets of adj(C,Xj) and of adj (C,Xj), one can consider all 
subsets of the union adj(C,Xj) U adj (C,Xj). 

3.3. Examples. We now illustrate the algorithms in two examples. In 
Example 1, the outputs of FCI and RFCI are identical. In Example 2, the 
outputs of FCI and RFCI are not identical, and the output of RFCI describes 
two Markov equivalence classes. We will see, however, that the ancestral or 
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(a) (b) (c) 

Fig. 2. Graphs corresponding to Example 1, where the outputs of FCI and RFCI are 
identical, (a) Underlying DAG with latent variables; (b) initial skeleton C\; (c) RFCI-PAG 
and FCI-PAG. 

causal information inferred from an RFCI-PAG is correct. Two additional 
examples illustrating details of Algorithms 4.4 and 4.5 of [5] are given in 
Section 5 of [5]. 

Example 1. Consider the DAG in Figure 2(a) containing observed 
variables X = {Ai, . . . , Aq}, latent variables L = {Li,L 2 } and no selection 
variables (S = 0). Suppose that all conditional independence relationships 
over X that can be read off from this DAG are used as input for the algo- 
rithms. 

In all algorithms, Step 1 is the same, and consists of finding an initial 
skeleton. This skeleton, denoted by C\, is shown in Figure 2(b). The final 
output given by both algorithms is shown in Figure 2(c). 

Comparing the initial skeleton with the final skeleton, we see that the edge 
Xi 0-0X5 is present in the initial but not in the final skeleton. The absence 
in the final skeleton is due to the fact that X± il_ A5|{A 2 , A3, A4}. The edge 
is present in the initial skeleton, since this conditional independence is not 
found in Step 1 of the algorithms, because {A2, A3, A4} is not a subset of 
adj(Ci, Ai) nor of adj(Ci, A 5 ). 

The FCI algorithm finds the conditional independence relationship X± J_ 
_L A5HA2, A3, A4} in Step 3 when subsets of Possible-D-SEP are consid- 
ered, since pds(C 2 , X\, A 5 ) \ {A x , A 5 } = {A 2 , A 3 , A 4 } and pds(C 2 , A 5 , A x ) \ 
{A5, Ai} = {A 2 , A3, A4, Xq}, where C 2 is the graph resulting from Step 2 of 
the algorithm. 

In the RFCI algorithm, the conditional independence relationship Ai il_ A5 
{A 2 , A3,A4} is also found, but by another mechanism. In Step 2 of Al- 
gorithm 3.2, unshielded triples are oriented after performing some addi- 
tional conditional independence tests. In particular, when considering the 
triple ( Ai , A5 , Ag) , the algorithm checks whether Ai _LL A5 1 (sepset ( Ai , Ag ) \ 
{A 5 }), where sepset(A 1 , A 6 ) = {A 2 , A 3 , A 4 }. 
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(a) (b) (c) 

Fig. 3. Graphs corresponding to Example 2, where the outputs of FCI and RFCI are 
not identical. The output of RFCI corresponds to two Markov equivalence classes when 
interpreted as an RFCI-PAG. (a) Underlying DAG Q with latent variables; (b) output of 
RFCI for Q; (c) output of FCI for Q. 

This example also shows why it is necessary to check unshielded triples 
according to Lemma 3.1 before orienting them as v-structures. Omitting 
this check for triple {Xi,X§,Xq} would orient it as a v-structure, since A5 ^ 
sepset(Ai, Xq). Hence, we would conclude that A5 an(<?, {Xq} U S), which 
contradicts the underlying DAG. 

Finally, we see that the orientations of the edges are identical in the 
outputs of both algorithms, which implies that the outputs encode the same 
ancestral information. 

Example 2. Consider the DAG Q in Figure 3(a), containing observed 
variables X = {Xx, . . . , A5}, latent variables L = {L±,L2} and no selection 
variables (S = 0) (see also [21], page 228, Figure 8a). Suppose that all con- 
ditional independence relationships over X that can be read off from this 
DAG are used as input for the algorithms. 

The outputs of the RFCI and FCI algorithms are shown in Figure 3(b) 
and (c), respectively. We see that the output of RFCI contains an extra 
edge, namely X\ -H> A5. 

As in Example 1, this edge is present after Step 1 of both algorithms. The 
reason is that the conditional independence X± il_ A5HA2, A3, A4} is not 
found, because {A2, A3, A4} is not a subset of adj(Ci, Ai) nor of adj(Ci, A5), 
where C\ denotes the skeleton after Step 1. 

The FCI algorithm finds this conditional independence in Step 3 when 
subsets of Possible-D-SEP are considered. The RFCI algorithm does not 
find this conditional independence, since the edge between Ai and A5 does 
not appear in an unshielded triple or a discriminating path. However, the 
ancestral information encoded by the output of RFCI is correct, and in this 
example identical to the ancestral information encoded by the output of FCI. 

Finally, we show that the RFCI-PAG in Figure 3(b) describes two Markov 
equivalence classes. Consider a new DAG Q' , which is adapted from Q in Fig- 
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ure 3(a) by adding one additional latent variable L3 pointing at X\ and A5. 
This modification implies that X\ and A5 are conditionally dependent given 
any subset of the remaining observed variables, so that Q' belongs to a dif- 
ferent Markov equivalence class than Q. The output of both FCI and RFCI, 
when using as input the conditional independence relationships that can be 
read off from Q' , is given in Figure 3(b). Hence, the PAG in Figure 3(b) rep- 
resents more than one Markov equivalence class if interpreted as an RFCI- 
PAG. 

3.4. A class of graphs for which the outputs of FCI and RFCI are identi- 
cal. We now specify graphical conditions on an underlying DAG Q = (V, E) 
with V = X U L U S such that the outputs of FCI and RFCI are identical 
(Theorem 3.3). Moreover, if the outputs of RFCI and FCI are not identical, 
we infer properties of edges that are present in the output of RFCI but not 
in that of FCI (Theorem 3.4). 

The results in this section rely on the concept of inducing paths [21, 22], 
which we have extended here: 

Definition 3.5. Let Q = (V,E) be a DAG with V = XULUS and 
let Y be a subset of X containing Xi and Xj with Xi 7^ Xj . A path n 
between Xi and Xj is called an inducing path relative to Y given S if and 
only if every member of Y U S that is a nonendpoint on tt is a collider on ir 
and every collider on tt has a descendant in {Xi, Xj} U S. 

We note that our Definition 3.5 corresponds to the one in [21] if Y = X. 
The existence of an inducing path in a DAG is related to (i-connection in the 
following way. There is an inducing path between Xi and Xj relative to Y 
given S if and only if Xi and Xj are not d-separated by (Y' U S) \ {Xi, Xj} 
for all Y'CY (see [21], Lemma 9, page 243). The definition of an inducing 
path is monotone in the following sense: if Yi C Y2 C X and there is an 
inducing path between Xi and Xj relative to Y2 given S, then there also is 
an inducing path between Xi and Xj relative to Yi given S. 

Consider a pair of vertices Xi,Xj £ X in an underlying DAG Q. We in- 
troduce the following shorthand notation. Let Adj(i,j) = adj(Ci,Xj) \ {Xj}, 
where C\ is the initial skeleton after Step 1 of the algorithms. Moreover, let 
Pds(i,j) = pds(C2, Xi, Xj) \ {Xj, Xi}, where C2 is the graph resulting from 
Step 2 of the FCI algorithm. By definition, Pds(k,i) 3 Adj(fc,i) for any pair 
of vertices X^,Xi £ X. We now consider the following three scenarios: 

(51) There is an inducing path between Xi and Xj in Q relative to 
Pds(i,j) given S, and there is an inducing path between Xi and Xj rel- 
ative to Pds(j,i) given S. 

(52) There is an inducing path between Xi and Xj in Q relative to 
Adj(i, j) given S, and there is an inducing path between Xi and Xj relative 
to Adj(j, i) given S. Moreover, there is no inducing path between Xi and Xj 
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in Q relative to Pds(i, j) given S, or there is no inducing path between 
and Xj in Q relative to Pds(j, i) given S. 

(S3) There is no inducing path between Xi and Xj in Q relative to 
Adj(i, j) given S, or there is no inducing path between Xi and Xj in Q 
relative to Adj(j, i) given S. 

We now obtain the following theorem: 

Theorem 3.3. Assume that the distribution o/V = XULuS is faithful 
to an underlying DA G Q . Then the output C of RFCI equals the output C" of 
F CI if for every pair of vertices X{, Xj in X either (SI) holds or (S3) holds. 
IfC'^C", then the skeleton of C is a strict superset of the skeleton of C" , 
and Scenario (S2) must hold for every pair of vertices that are adjacent in C' 
but not in C" . 

Scenario (S2) occurs if and only if (i) there is a path n(i,j) between Xi 
and Xj in the underlying DAG Q that satisfies: (cl) all colliders on ir(i,j) 
have descendants in {Xi, Xj} U S, (c2) every member of Adj(i,j) U S on 
7r(i, j) is a collider on 7r(z, j), (c3) there is a member of (Pds(i, j)UPds(j, i))\ 
Adj(i, j) on vr(i, j) that is not a collider on the path, and (ii) there is a path 
Tr(j,i) between Xj and Xi in the underlying DAG that satisfies conditions 
(cl)-(c3) above with the roles of i and j reversed. In condition (c3), an equiv- 
alent formulation is given by replacing Pds(i, j) UPds(j, i) with X\ {Xi, Xj}. 

To illustrate Theorem 3.3, consider again Example 2 and the graphs in 
Figure 3. The output of RFCI for the underlying DAG shown in Figure 3(a) 
contains an edge between X\ and X5, while the output of FCI does not. 
According to Theorem 3.3, Scenario (S2) must hold for the vertices X± 
and A5. Hence, there must exist paths 7r(l, 5) and 7r(5, 1) between X\ and X§ 
in the underlying DAG that satisfy conditions (cl)-(c3) above. This is indeed 
the case for the path ir = n(l, 5) = 7r(5, 1) = (X\, L\, X2, X3, AT4, L2, X5): (cl) 
there are two colliders on tt, X2 and A4, both with descendants in {Xi, X5}, 
(c2) all members of Adj(l,5) = Adj(5, 1) = {X2,X^} on tt are colliders on 
the path, and (c3) X3 is a member of (Pds(l,5) UPds(5, 1)) \ Adj(l,5) = 
(Pds(5, 1) U Pds(l, 5)) \ Adj(5, 1) on tt and is a noncollider on tt. 

To see that the occurrence of Scenario (S2) does not always lead to a dif- 
ference in the outputs of FCI and RFCI, we revisit Example 1 and the graphs 
in Figure 2. The same path tt as above satisfies conditions (cl)-(c3) in the 
underlying DAG, but the outputs of FCI and RFCI are identical (due to 
the extra tests in Step 2 of Algorithm 3.2). This illustrates that fulfillment 
of (SI) or (S3) for every pair of vertices is not a necessary condition for 
equality of FCI and RFCI. 

Finally, the following theorem establishes features of edges that are present 
in an RFCI-PAG but not in an FCI-PAG. 

Theorem 3.4. Assume that the distribution o/V = XULOS is faithful 
to an underlying DAG Q. If there is an edge Xi**Xj in an RFCI-PAG 
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for Q that is not present in an FCI-PAG for Q, then the following hold: 
(i) Xi ^ an(Q,Xj U S) and Xj ^ &n(Q,Xi U S), and (ii) each edge mark of 
X{*^Xj in the RFCI-PAG is a circle or an arrowhead. 

4. Consistency of FCI and RFCI in sparse high- dimensional settings. 

Let Q = (V,E) be a DAG with V = XULUS and let M be the corre- 
sponding unique MAG over X. We assume that we observe n i.i.d. copies 
of W = {W\, . . . , W p ) ~ (Xi |S, . . . , X p |S). To represent high-dimensional be- 
havior, we let the DAG Q and the number of observed variables p in X 
grow as a function of the sample size, so that p = p n , Q = Qn and M. = A4 n . 
We do not impose any restrictions on the number of latent and selection 
variables. Throughout, we assume that W is multivariate Gaussian, so that 
conditional independence is equivalent to zero partial correlation. 

In Section 4.1, we define the sample versions of RFCI and the different 
versions of FCI. Sections 4.2 and 4.3 contain consistency results for RFCI 
and FCI in sparse high-dimensional settings. The conditions required for 
consistency of RFCI are considerably weaker than those for FCI. 

4.1. Sample versions of RFCI and the different versions of FCI. Let 
Pn-,i,j\Y De the partial correlation between Wi and Wj in W given a set 
Y C W \ {Wi, Wj}, and let p n -i.j\Y be the corresponding sample partial 
correlation. We test if a partial correlation is equal to zero after applying 
Fisher's z-transform defined as g(x) = ^log(j3j). Thus, we consider 

Zn;i,j\Y = 9{f>n;i,j\Y) an d z n;i,j\Y = 9{Pn;i,j\Y) 

and we reject the null-hypothesis Hq(i, j|Y) : PijiY = against the two-sided 
alternative Ha(i, j|Y) : Pij\Y 7^ at significance level a if 

(4.1) \z n , iJlY \ > - a/2)(n - |Y| - 3)" 1 / 2 , 

where $(•) denotes the cumulative distribution function of a standard nor- 
mal random variable. (We assume n > |Y| +3.) 

Sample versions of RFCI and the different versions of FCI can be obtained 
by simply adapting all steps with conditional independence decisions as fol- 
lows: Xi and Xj are judged to be conditionally independent given Y' U S for 
Y' CX\{I i ,I i } if and only if |4 ;i j| Y | < ^"H 1 ~ «/2)(n - |Y| - 3)" 1 / 2 
for Y ~ Y'jS. The parameter a is used for many tests, and plays the role of 
a tuning parameter. 

4.2. Consistency of RFCI. We impose the following assumptions: 

(Al) The distribution of W is faithful to the underlying causal MAG A4 n 
for all n. 

(A2) The number of variables in X, denoted by p n , satisfies p n = 0(n a ) 
for some < a < oo. 
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(A3) The maximum size of the adjacency sets after Step 1 of the oracle 
RFCI algorithm, denoted by q n = maxi<j< Pn (|adj(Ci, Xi)\), where C\ is the 
skeleton after Step 1, satisfies q n = 0(n^ ) for some < b < 1. 

(A4) The distribution of W is multivariate Gaussian. 

(A5) The partial correlations satisfy the following lower and upper bound 
for all Wi,Wj € {W u ...,W Pn } and Y C {W u ...,W Pn } \ {Wi,Wj} with 
|Y|<g n : 

^{\Pn;i,j\y\ : Pn;i,j\Y + 0} > C„, 

SUp{|p„ ; ij| Y | :«^J'} < M < 1, 

where c~ l = 0(n d ) for some < d < 6/2 with b from (A3). 

Assumption (A2) allows the number of variables to grow as any polynomial 
of the sample size, representing a high-dimensional setting. Assumption (A3) 
is a sparseness assumption, and poses a bound on the growth of the max- 
imum size of the adjacency sets in the graph resulting from Step 1 of the 
oracle RFCI algorithm. The upper bound in assumption (A5) excludes se- 
quences of models in which the partial correlations tend to 1, hence avoiding 
identifiability problems. The lower bound in assumption (A5) requires the 
nonzero partial correlations to be outside of the n~ h l 2 range, with b as in 
assumption (A3). This condition is similar to assumption 5 in [12] and con- 
dition (8) in [25]. 

The similarities between our assumptions and the assumptions of [8] for 
consistency of the PC algorithm are evident. The main differences are that 
our assumption (A3) concerns the skeleton after Step 1 of the oracle RFCI 
algorithm instead of the underlying DAG, and that our assumptions (Al) 
and (A4)-(A5) concern the distribution of W instead of X. 

Theorem 4.1. Assume (Al)-(A5). Denote by C n {a n ) the output of the 
sample version of the RFCI algorithm and by C' n the oracle version of the 
RFCI algorithm. Then there exists a sequence a n — > (n — > oo) and a con- 
stant < C < oo such that 

F[C n (a n ) = C' n \ > 1 - 0(exp(-CV- 2d )) 1 as n -> oo, 

where d> is as in (A5). 

One such sequence for a n is a n = 2(1 — <I>(n 1 / 2 c n /2)), where c n is the 
lower bound in (A5) (which depends on the unknown data distribution). 

4.3. Consistency of FCI. Assume (A1)-(A5) of Section 4.2, but replace 
(A3) by (A3'): 

(A3') The maximum size of the Possible-D-SEP sets in Step 3 of the 
oracle FCI algorithm, denoted by r n = maxi<j< p?i (|pds(C2, Xi, •)[), where C2 
is the graph resulting from Step 2, satisfies r n = 0(n l ~ b ) for some < b < 1. 
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Assumption (A3') is stronger than assumption (A3), since the skeleton 
after Step 1 of the RFCI algorithm is identical to the skeleton after Step 2 
of the FCI algorithm, and since the adjacency set is contained in Possible-D- 
SEP by definition. (In fact, one can construct sequences of graphs in which 
the maximum size of the adjacency sets is fixed, but the maximum size of the 
Possible- D-SEP sets grows linearly with the number of vertices.) The stricter 
assumption (A3') is needed for the additional conditional independence tests 
in Step 3 of the FCI algorithm. 

Theorem 4.2. Assume (Al)-(A5) with (A3') instead of (A3). Con- 
sider one of the sample versions of FCI, FCI pait h> CFCI, CFCI pait h, SCFCI 
or SCFCIpath, and denote its output by C*(a n ). Denote the true underly- 
ing FCI-PAG by C n . Then there exists a sequence a n — > (n — > oo) and 
a constant < C < oo such that 

P[C*(a„) = C n ] > 1 - 0(exp(-Cn 1_M )) -> 1 as n ^ oo, 

where d> is as in (A5) . 

As before, one such sequence for a n is a n = 2(1 — $(?i 1 / 2 c n /2)), where c n 
is the lower bound in (A5). 

5. Numerical examples. In this section we compare the performance of 
RFCI and different versions of FCI and Anytime FCI in simulation studies, 
considering both the computing time and the estimation performance. Since 
Anytime FCI requires an additional tuning parameter (see [19] and Section 3 
of [5]), we cannot compare it directly. We therefore define a slight modifica- 
tion, called Adaptive Anytime FCI (AAFCI), where this tuning parameter is 
set adaptively (see Section 3 of [5]). Our proposed modifications of FCI (see 
Section 3.1) can also be applied to AAFCI, leading to the following algo- 
rithms: AAFCIp a th, CAAFCI, CAAFCI path , SCAAFCI and SCAAFCI path . 

The remainder of this section is organized as follows. The simulation 
setup is described in Section 5.1. Section 5.2 shows that the estimation 
performances of RFCI and all versions of FCI and AAFCI are very similar. 
Section 5.3 shows that our adaptations of FCI and AAFCI can reduce the 
computation time significantly for graphs of moderate size, but that RFCI 
is the only feasible algorithm for large graphs. 

5.1. Simulation setup. We use the following procedure to generate a ran- 
dom DAG with a given number of vertices p' and expected neighborhood 
size E(N). First, we generate a random adjacency matrix A with indepen- 
dent realizations of Bernoulli {E(N)/(p' — 1)) random variables in the lower 
triangle of the matrix and zeroes in the remaining entries. Next, we replace 
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the ones in A by independent realizations of a Uniform ([0.1, 1]) random vari- 
able. A nonzero entry Aij can be interpreted as an edge from Xj to Xi with 
"strength" Aij, in the sense that X\, . . . ,X p i can be generated as follows: 
X\ = £\ and Xi = X)r=l Ai r X r + £j for i = 2,... ,p', where e±, . . . ,e p > are 
mutually independent A/"(0, 1) random variables. The variables X%, . . . ,X P > 
then have a multivariate Gaussian distribution with mean zero and covari- 
ance matrix E' = (1 — A)~ l (t — A)~ T , where 1 is the p' x p' identity matrix. 

To assess the impact of latent variables, we randomly define half of the 
variables that have no parents and at least two children to be latent (we do 
not consider selection variables) . We restrict ourselves to variables that have 
no parents and at least two children, since these are particularly difficult for 
RFCI in the sense that they are likely to satisfy Scenario (S2) in Section 3.4. 
Throughout, we let p denote the number of observed variables. 

We consider the oracle versions of RFCI and FCI pa th (note that the out- 
puts of FCIp a th and FCI are identical in the oracle versions), and the sam- 
ple versions of RFCI, (AA)FCI, (AA)FCI path , C(AA)FCI, C(AA)FCI path 
and SC(AA)FCI pa th- In all plots (AA)FCI pat h is abbreviated as (AA)FCIp. 
Let E be the p x p matrix that is obtained from E' by deleting the rows 
and columns that correspond to latent variables. The oracle versions of the 
algorithms use E as input, and the sample versions of the algorithms use 
simulated data from a iV p (0,E) distribution as input. 

The simulations were performed on an AMD Opteron (tm) Quad Core 
Processor 8380 with 2.5 GHz and 2 GB RAM on Linux using R 2.11.0. 

5.2. Estimation performance. We first investigated the difference be- 
tween the oracle versions of RFCI and FCI pa th, using simulation settings 
p' £ {15,20,25} and E(N) = 2. For each combination of these parameters, 
we generated 1000 DAGs, where the average number of observed variables 
was p {14,18,23} (rounded to the nearest integer). For each simulated 
graph, we assessed whether the outputs of FCI pa th and RFCI were differ- 
ent, and if this was the case, we counted the number of additional edges 
in the output of RFCI when compared to that of FCI. For p' = 15, p' = 20 
and p' = 25, there were 0, 1 and 5 of the 1000 DAGs that gave different re- 
sults, and whenever there was a difference, the output of RFCI had a single 
additional edge. Hence, for these simulation settings, the oracle versions of 
FCI pa th and RFCI were almost always identical, and if there was a difference, 
the difference was very small. 

Next, we investigated the performance of the sample versions of RFCI and 
our adaptations of FCI and AAFCI, considering the number of differences 
in the output when compared to the true FCI- PAG. We used two simulation 
settings: small-scale and large-scale. 

The small-scale simulation setting is as follows. For each value of p' € 
{10,15, 20,25,30}, we generated 50 random DAGs with E(N) = 2, where 
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Small-scale: Edges Small-scale: Edge marks 




Fig. 4. Estimation performance of the sample versions of RFCI and the different ver- 
sions of FCI and AAFCI in the small-scale setting, when compared to the true underlying 
FCI-PAG. The simulation settings were E(N) = 2, n = 1000 and a — 0.01. (a) Average 
number of missing or extra edges over 50 replicates; (b) average number of different edge 
marks over 50 replicates. 



the average number of observed variables was p ~ {9, 14, 18, 23, 27}. For each 
such DAG, we generated a data set of size n = 1000 and ran RFCI, (AA)FCI, 
(AA)FCI path , C(AA)FCI and SC(AA)FCI path with tuning parameter a = 
0.01. 

Figure 4 shows the results for the small-scale setting. Figure 4(a) shows 
the average number of missing or extra edges over the 50 replicates, and we 
see that this number was virtually identical for all algorithms. Figure 4(b) 
shows the average number of different edge marks over the 50 replicates. 
We again see that all algorithms performed similarly. We note that the con- 
servative and superconservative adaptations of the algorithms yield slightly 
better edge orientations than the standard versions for larger graphs. 

The large-scale simulation setting is as follows. For each value of p' £ {100, 
200,300,500} we generated 100 random DAGs with E(N) =3, where the 
average number of observed variables was p~ {90,180,271,452}. For each 
DAG, we generated a data set of size n = 1000, and ran RFCI, CFCI pat h 
and CAAFCIp a th [the other versions of (AA)FCI were computationally in- 
feasible] using tuning parameter a = 0.01. To ensure reasonable computing 
times, we terminated an algorithm for a graph if it was not finished after 
eight hours. For CFCI pa th, termination occurred five times for p' = 300 and 
nine times for p' = 500. One of the latter nine graphs also led to termination 
of CAAFCI pa th- To ensure comparability we deleted any run which did not 
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Large-scale: Edges 



Large-scale: Edge marks 
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Fig. 5. Estimation performance of the sample versions of RFCI and the fastest versions 
of FCI and AAFCI in the large-scale setting, when compared to the true underlying FCI- 
PAG. The simulation settings were E(N) = 3, n= 1000 and a = 0.01. (a) Average number 
of missing or extra edges over 91 replicates (see text); (b) average number of different edge 
marks over 91 replicates (see text). 



complete for all algorithms and computed the average number of missing 
or extra edges [see Figure 5(a)] and the average number of different edge 
marks [see Figure 5(b)] over the 91 remaining runs. We again see that all 
algorithms performed similarly. 

5.3. Computing time. We first compared the size of the Possible-D-SEP 
sets in the different versions of FCI, since this is the most important factor for 
the computing time of these algorithms. In particular, if the size of Possible- 
D-SEP is, say, 25 vertices or more, it becomes computationally infeasible 
to consider all of its subsets. For all combinations of p' € {10,50,250} and 
E(N) £ {2,3}, we generated 100 random graphs and ran the oracle version 
of FCI and FCI pat h and the sample versions of FCI, FCI pat h, CFCI and 
CFCIp a th- The average number of observed variables was p ~ {9, 46, 230} for 
E(N) = 2 and p ~ {9, 45, 226} for E(N) = 3. For the sample versions of the 
algorithms we used sample size n = 1000 and tuning parameter a = 0.01. For 
each simulated graph and each algorithm we computed the maximum size 
of the Possible-D-SEP sets over all vertices in the graph. We averaged these 
numbers over the 100 replicates, and denoted the result by mean-max-pds. 
The results are shown in Figure 6. We see that the new definition of pds pat h 
(see Definition 3.4 used in algorithm FCI pat h and CFCI pat h) reduced mean- 
max-pds slightly, while the conservative adaptations of the sample versions 
of the algorithms reduced it drastically. These results are also relevant for the 



26 COLOMBO, MAATHUIS, KALISCH AND RICHARDSON 



Mean-max-pds, E(N)=2 Mean-max-pds, E(N)=3 
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Fig. 6. A plot of mean-max-pds (see text) versus p' , where both axes are drawn in log 
scale. The horizontal line at mean-max-pds = 24 indicates an upper bound that still yields 
a feasible running time of the algorithms. 

different versions of AAFCI, since AAFCI considers all subsets of Possible 
D-SEP up to a certain size. This again becomes infeasible if Possible D-SEP 
is large. 

Next, we investigated the computing time of the sample version of RFCI 
and modifications of FCI and AAFCI under the same simulation settings as 
in Section 5.2. 

Figure 7(a) shows the average running times over the 50 replicates in the 
small-scale setting. We see that RFCI was fastest for all parameter settings, 
while the standard version of FCI was slowest for all settings with p' > 15. 
Our new adaptations of FCI and AAFCI reduced the running time of FCI 
and AAFCI significantly, which is in correspondence with the reduction in 
mean-max-pds that we saw in Figure 6. 

Figure 7(b) shows the average running times over the 91 fastest runs 
in the large-scale setting. We see that RFCI is the only algorithm that is 
computationally feasible for large graphs: for p' = 500 RFCI took about 40 
seconds, while the fastest modifications of FCI took about 10,000 seconds. 
These results can be explained by the fact that Steps 2 and 3 in the RFCI 
algorithm only involve local tests (conditioning on subsets of the adjacency 
set of a vertex), while Step 3 of (AA)FCI considers subsets of the Possible 
D-SEP sets, which can be large even for sparse graphs (see Figure 6). 

6. Discussion. In this paper, we introduce a new algorithm for learning 
PAGs, called the Really Fast Causal Inference (RFCI) algorithm. RFCI uses 
fewer conditional independence tests than the existing FCI algorithm, and 
its tests condition on a smaller number of variables. 
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(a) 



(b) 



Fig. 7. Running time of the sample versions of the algorithms, using simulation settings 
n = 1000 and a = 0.01, where the y-axes are drawn in log scale, (a) Average running time 
in seconds of each algorithm over 50 replicates, using E(N) = 2; (b) average running time 
in seconds of each algorithm over 91 replicates (see text), using E(N) — 3. 



The output of RFCI can be interpreted as the output of FCI, with the only 
difference that the presence of an edge has a weaker meaning. In particular, 
the interpretation of tails and arrowheads is identical for both algorithms. 
In this sense the RFCI algorithm is similar to the Anytime FCI algorithm 
of [19]. 

We describe a class of graphs where the outputs of FCI and RFCI are 
identical, and show that differences between the two algorithms are caused 
by very special structures in the underlying DAG. We confirm this finding 
in simulation studies that show that differences between the oracle versions 
of RFCI and FCI are very rare. 

We prove consistency of FCI and RFCI in sparse high-dimensional set- 
tings. The sparsity conditions needed for consistency of RFCI are consid- 
erably weaker than those needed for FCI, due to the lower computational 
complexity of the RFCI algorithm. 

We compare RFCI with several modifications of (Anytime) FCI in sim- 
ulation studies. We show that all algorithms perform similarly in terms of 
estimation, and that RFCI is the only algorithm that is computationally 
feasible for high-dimensional sparse graphs. 

We envision several possible uses of RFCI. First, it could be used in 
addition to the PC algorithm to assess the potential impact of the existence 
of latent or selection variables. Second, it could be used as a building block 
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for an IDA-like method [10, 11] to obtain bounds on causal effects based on 
observational data that is faithful to an unknown underlying causal graph 
with arbitrarily many latent and selection variables. In order to achieve the 
latter, we plan to build on the work of [17, 23], who made a start with 
the study of causal reasoning for ancestral graphs. Other interesting open 
problems include investigating which RFCI-PAGs can only correspond to 
a single Markov equivalence class, and investigating completeness of the 
RFCI algorithm, that is, investigating whether the edge marks in the output 
of RFCI are maximally informative. 

SUPPLEMENTARY MATERIAL 

Supplement to "Learning high-dimensional directed acyclic graphs with 
latent and selection variables" (DOI: 10. 1214/1 1-AOS940SUPP; .pdf). All 
proofs, a description of the Adaptive Anytime FCI algorithm, pseudocodes, 
and two additional examples can be found in the supplementary docu- 
ment [5]. 



REFERENCES 

Aho, A., Hopcroft, J. and Ullman, J. D. (1974). The Design and Analysis of 

Computer Algorithms. Addison- Wesley, Boston, MA. 
Ali, R. A., Richardson, T. S. and Spirtes, P. (2009). Markov equivalence for 

ancestral graphs. Ann. Statist. 37 2808-2837. MR2541448 
Andersson, S. A., Madigan, D. and Perlman, M. D. (1997). A characterization 

of Markov equivalence classes for acyclic digraphs. Ann. Statist. 25 505-541. 

MR1439312 

Chickering, D. M. (2002). Learning equivalence classes of Bayesian- network struc- 
tures. J. Much. Learn. Res. 2 445-498. MR1929415 

Colombo, D., Maathuis, M. H., Kalisch, M. and Richardson, T. S. (2012). 
Supplement to "Learning high-dimensional directed acyclic graphs with latent 
and selection variables." DOL10.1214/11-AOS940SUPP. 

Cooper, G. (1995). Causal discovery from data in the presence of selection bias. In 
Preliminary Papers of the Fifth International Workshop on Artificial Intelligence 
and Statistics (D. Fisher, ed.) 140-150. 

Dawid, A. P. (1980). Conditional independence for statistical operations. Ann. 
Statist. 8 598-617. MR0568723 

Kalisch, M. and Buhlmann, P. (2007). Estimating high-dimensional directed 
acyclic graphs with the PC-algorithm. J. Mach. Learn. Res. 8 613-636. 

Kalisch, M., Machler, M., Colombo, D., Maathuis, M. H. and Buhlmann, P. 
(2012). Causal inference using graphical models with the R package pcalg. 
J. Statist. Software. To appear. 

Maathuis, M. H., Colombo, D., Kalisch, M. and Buhlmann, P. (2010). Predict- 
ing causal effects in large-scale systems from observational data. Nat. Methods 
7 247-248. 

Maathuis, M. H., Kalisch, M. and Buhlmann, P. (2009). Estimating high- 
dimensional intervention effects from observational data. Ann. Statist. 37 3133- 
3164. MR2549555 



LEARNING DAGS WITH LATENT AND SELECTION VARIABLES 



29 



[12] Meinshausen, N. and Buhlmann, P. (2006). High-dimensional graphs and variable 
selection with the lasso. Ann. Statist. 34 1436-1462. MR2278363 

[13] Pearl, J. (2000). Causality. Models, Reasoning, and Inference. Cambridge Univ. 
Press, Cambridge. MR1744773 

[14] Pearl, J. (2009). Causal inference in statistics: An overview. Stat. Surv. 3 96-146. 
MR2545291 

[15] Ramsey, J., Zhang, J. and Spirtes, P. (2006). Adjacency-faithfulness and con- 
servative causal inference. In Proceedings of the 22nd Annual Conference on 
Uncertainty in Artificial Intelligence. AUAI Press, Arlington, VA. 

[16] Richardson, T. and Spirtes, P. (2002). Ancestral graph Markov models. Ann. 
Statist. 30 962-1030. MR1926166 

[17] Richardson, T. S. and Spirtes, P. (2003). Causal inference via ancestral graph 
models. In Highly Structured Stochastic Systems. Oxford Statistical Science Se- 
ries 27 83-113. Oxford Univ. Press, Oxford. MR2082407 

[18] Robins, J. M., Hernan, M. A. and Brumback, B. (2000). Marginal structural 
models and causal inference in epidemiology. Epidemiology 11 550-560. 

[19] Spirtes, P. (2001). An anytime algorithm for causal inference. In Proc. of the Eighth 
International Workshop on Artificial Intelligence and Statistics 213-221. Morgan 
Kaufmann, San Francisco. 

[20] Spirtes, P., Glymour, C. and Scheines, R. (2000). Causation, Prediction, and 
Search, 2nd ed. MIT Press, Cambridge, MA. MR1815675 

[21] Spirtes, P., Meek, C. and Richardson, T. (1999). An algorithm for causal in- 
ference in the presence of latent variables and selection bias. In Computation, 
Causation, and Discovery 211-252. AAAI Press, Menlo Park, CA. MR1689972 

[22] Verma, T. and Pearl, J. (1990). Equivalence and synthesis of causal models. In 
Proceedings of the Sixth Annual Conference on Uncertainty in Artificial Intelli- 
gence 255-270. Elsevier, New York. 

[23] Zhang, J. (2008). Causal reasoning with ancestral graphs. J. Mach. Learn. Res. 9 
1437-1474. MR2426048 

[24] Zhang, J. (2008). On the completeness of orientation rules for causal discovery in 
the presence of latent confounders and selection bias. Artificial Intelligence 172 
1873-1896. MR2459793 

[25] Zhao, P. and Yu, B. (2006). On model selection consistency of Lasso. J. Mach. 
Learn. Res. 7 2541-2563. MR2274449 



D. Colombo 

M. H. Maathuis 

M. Kalisch 

ETH Zurich 

Seminar for Statistics 

Ramistrasse 101 

8092 Zurich 

Switzerland 

E-mail : colombo@stat . math.ethz . ch 
maathuis@stat . math.ethz . ch 
kalischOstat . math.ethz . ch 



T. S. Richardson 
Department of Statistics 
University of Washington 
Seattle, Washington 98195 
USA 

E-MAIL: thomasrOu. washington.edu 



